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GENERAL NONEXACT ORACLE INEQUALITIES FOR CLASSES 
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CNRS, Universite Paris-Est Marne-la-vallee and Technion, 
Israel Institute of Technology 

We show that empirical risk minimization procedures and regu- 
larized empirical risk minimization procedures satisfy nonexact oracle 
inequalities in an unbounded framework, under the assumption that 
the class has a subexponential envelope function. The main novelty, 
in addition to the boundedness assumption free setup, is that those 
inequalities can yield fast rates even in situations in which exact or- 
acle inequalities only hold with slower rates. 

We apply these results to show that procedures based on £1 and 
nuclear norms regularization functions satisfy oracle inequalities with 
a residual term that decreases like 1/n for every L 9 -loss functions 
(<? > 2), while only assuming that the tail behavior of the input and 
output variables are well behaved. In particular, no RIP type of as- 
sumption or "incoherence condition" are needed to obtain fast resid- 
ual terms in those setups. We also apply these results to the problems 
of convex aggregation and model selection. 

1. Introduction and main results. Let Z be a space endowed with a prob- 
ability measure P, and let Z and Z±,...,Z n be n + 1 independent random 
variables with values in Z, distributed according to P; from the statisti- 
cal point of view, T> = {Z\, . . . , Z n ) is the set of given data. Let I be a loss 
function which associates a real number £(/, z) to any real- valued measur- 
able function / defined on Z and any point z G Z. Denote by if the loss 
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function £(f,-) associated with / and set R(f) = M£f(Z) to be the associ- 
ated risk. The risk of any statistic f n (') = fn{',F)):Z — > R is defined by 
R(f n ) = E[£ fn (Z)\V]. 

Let F be a class (usually called the model) of real-valued measurable 
functions defined on Z. In learning theory, one wants to assume as little as 
possible on the class F, or on the measure P. The aim is to use the data to 
construct learning algorithms whose risk is as close as possible to inf f^p R(f) 
(and when this infimum is attained by a function f F in F, this element is 
called an oracle). Hence, one would like to construct procedures f n such 
that, for some e > 0, with high probability, 

(1.1) R(f n ) < (I + e) inf R(f) + r n (F). 

The role of the residual term (or rate) r n {F) is to capture the "complexity" 
of the problem, and the hope is to make it as small as possible. 

When r n (F) tends to zero as n tends to infinity, inequality (1.1) is called 
an oracle inequality. When e = 0, we say that f n satisfies an exact oracle 
inequality (the term sharp oracle inequality has been also used) and when 
e > it satisfies a nonexact oracle inequality. Note that the terminology "risk 
bounds" has been also used for (1.1) in the literature. 

A natural algorithm in this setup is the empirical risk minimization proce- 
dure (ERM) (terminology due to [43] ) , in which the empirical risk functional 

f^R n (f) = lj2tf(Zi) 
i=l 

is minimized and produces /^ RM € Argminj g i?i? n (/). Note that when R n (-) 
does not achieve its infimum over F or if the minimizer is not unique, we 
define j^ RM to be an element in F for which R(f% RM ) < inf /eF R(f) + 1/n. 
This algorithm has been extensively studied, and we will compare our first 
result to the one of [4, 12, 24]. 

One motivation in obtaining nonexact oracle inequalities [equation (1.1) 
for e ^ 0] is the observation that in many situations, one can obtain such an 
inequality for the ERM procedure with a residual term r n (F) of the order 
of 1/n, while the best residual term achievable by ERM in an exact oracle 
inequality [equation (1.1) for e = 0] will only be of the order of 1/y/ri for the 
same problem. For example, consider the simple case of a finite model F 
of cardinality M and the bounded regression model with the quadratic loss 
function [i.e., Z = (X,Y) £ X xR with |F|,max/ 6 i?|/(X)| < C for some 
absolute constant C and £(f,(X,Y)) = (Y — f(X)) 2 ]. It can be verified 
that for every x > 0, with probability greater than 1 — 8exp(— x), /^ RM 
satisfies a nonexact oracle inequality with a residual term proportional to 
(x + log M)/(en) . On the other hand, it is known [19, 28, 44] that in the 
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same setup, there are finite models for which, with probability greater than 
a positive constant, /^ RM cannot satisfy an exact oracle inequality with 
a residual term better than coa/ (log M)/n. Thus, it is possible to establish 
two optimal oracle inequalities [i.e., oracle inequalities with a nonimprovable 
residual term r n {F) up to some multiplying constant] for the same proce- 
dure with two very different residual terms: one being the square of the 
other one. We will see below that the same phenomenon occurs in the clas- 
sification framework for VC classes. Thus our main goal here is to present 
a general framework for nonexact oracle inequalities for ERM and RERM 
(regularized ERM), and show that they lead to fast rates in cases when the 
best known exact oracle inequalities have slow rates. 

Although the improved rates are significant, it is clear that exact inequal- 
ities are more "valuable" from the statistical point of view. For example, 
consider the regression model with the quadratic loss. It follows from an ex- 
act oracle inequality on the prediction risk [equation (1.1) for e = 0], another 
exact oracle inequality, but for the estimation risk 

n/n ERM -riiL< mf ||/-ril| 2 +r n (F), 

where /* is the regression function of Y given X , and || • ||l 2 is f ne ^2- norm 
with respect to the marginal distribution of X. 

In other words, exact oracle inequalities for the prediction risk R(-) pro- 
vide both prediction and estimation results (prediction of the output Y and 
estimation of the regression function /* ) whereas nonexact oracle inequali- 
ties provide only prediction results. 

Of course, nonexact inequalities are very useful when it suffices to compare 
the risk R(f n ) with (1 + e) infj g ^ P(/); and the aim of this note is to show 
that the residual term can be dramatically improved in such cases. 

1.1. Empirical risk minimization. The first result of this note is a nonex- 
act oracle inequality for the ERM procedure. To state this result, we need 
the following notation. Let G be a class of real- valued functions defined on Z. 
An important part of our analysis relies on the behavior of the supremum 
of the empirical process indexed by G 

(1-2) ||P-P n || G = sup|(P-P n )( 5 )|, 

geG 

where for every g G G, we set Pg = ~Eg(Z) and P n g = n~ l Ya=i di^i)- Recall 
that for every a > 1, the tp a norm of g(Z) is 

\\g(Z)\\^=mi(c>0:EeM\g(ZT/c a )<2). 
We will control the supremum (1.2) using the quantities 



a{G) = sup v Pg 2 and b n (G) = max sup | g (Z. 
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Note that for a bounded class G, one has b n (G) < sup 9gG \\g\\oo and in 
the sub-exponential case, b n (G) < (logen)||sup 9gG |g| (this follows from 
Pisier's inequality); cf. Lemma 2.2.2 in [42] . Throughout this note we will also 
use the notation b n (g) = ||maxi<j< n |g(Zj)| and for any pseudo-norm || • || 
on L2(P), we will denote by diam(G, || • ||) = sup 9eG ||g|| the diameter of G 
with respect to this norm. 

Observe that the desired bound depends on the ipi behavior of the enve- 
lope function of the class, sup g ^ G \g(Z)\, and as noted above, this extends 
the "classical" framework of a uniformly bounded class in L^. Although 
this extension seems minor at first, the examples we will present show that 
the assumption is not very restrictive and allows one to deal with LASSO- 
type situations, in which the indexing class is very small — something which 
is impossible under the assumption. On the other hand, it should be 
emphasized that this is not a step towards an unbounded learning theory. 
For such results, the analogous assumption should be that the class has 
a bounded diameter in ipi, which is, of course, a much weaker assumption 
than a ijji envelope function and requires different methods; see, for example, 
[27, 34]. 

To obtain the required bound, we will study empirical processes indexed 
by sets associated with G, namely, the star-shaped hull of G around zero 
and the localized subsets for different levels A > 0, defined by 

V(G) = {0g:0<6<l,geG} and V(G)\ = {/i£ V{G) : Ph < A}. 

Given a model F and a loss function £, consider the loss class and the ex- 
cess loss class £f = {£j : / G F} and the excess loss class Cf = {£ / — £ /* : / G F} . 
We will assume that an oracle f F exists in F, and from here on set Cf = 



if - k 



F 



Theorem A. There exists an absolute constant cq > for which the 
following holds. Let F be a class of functions and assume that there exists 
B n >0 such that for every f G F, Plj < B n P£ f + B 2 Jn. Let < e < 1/2, 
set A* > for which 

E||P n -P|| n ^<(e/4)A: 

and put p n an increasing function satisfying that for every x > 0, 

(b n (£ F ) + B n /e)x s 



p n (x) > max A*, c 

y ne 

Then, for every x > 0, with probability greater than 1 — 8exp(— x), 

R(f* RM )<(l + 3e) M R(f) + p n (x). 

,/fc-f' 

Remark 1.1. Although the formulation of Theorem A requires that for 
every £ G £f, P£ 2 < B n P£ + B^/n, we will show that if £ is nonnegative, this 
condition is trivially satisfied for B n ~ diam(£p, ^i) log(n). 



NONEXACT ORACLE INEQUALITIES 



5 



Unfortunately, this type of condition is far from being trivially satisfied 
for the excess loss class Cf = {if — if* '■ f € F}, which is one of the ma- 
jor differences between exact and nonexact oracle inequalities. Indeed, the 
Bernstein condition, that for every / 6 F, E£j < BKCf (see [4] or Section 6 

below), used in [4, 12, 24] to obtain exact oracle inequalities with fast rates 
(rates of the order of 1/ra), depends on the geometry of the problem [29, 30] 
and may not be true in general. Theorem A is similar in nature to Corol- 
lary 2.9 of [4] and a detailed comparison between the two results can be 
found in Section 6. 

Theorem A is similar in nature to Theorem 2 in [24]. 

Theorem 1.2. Let <j):M — >WL be a nondecreasing, continuous function, 
for which 0(1) > 1 and x — > <j)(x)/x is nonincreasing. Set F to be a class 
of functions where there is some < (3 < 1 such that KCj < B(EC f ) p and 

UfWoo < 1- If 4>W > v / ™ Esu P/, 9 GF,P(^-£ 9 )2<A2( i3 - -Pn)^/ - l g ) for any A 
satisfying 4*(X) < y/nX 2 , and is the unique solution of the equation -y/ne 2 = 
(j){\f~BE*), then for every x >1, with probability greater than 1 — exp(— x), 

i?(/ n ERM ) < M R(f) + coxel 

One of the applications of the above theorem in learning theory is for the 
loss function £f(x,y) = tff x \^ y . It leads to an exact oracle inequality for the 
ERM procedure, preformed in a class F of VC dimension V < n (see [24] for 
more details), and with a residual term of the order of (V \og(enB 1 ' @ /V) / 
n )V(2-fl 

In comparison, in the same situation, for every / £ F, Ki'j <K£f. There- 
fore, it follows from Theorem A, the argument used to obtain equation (29) 
in [24] (or Example 3 in [12]) and the peeling argument which will be pre- 
sented in (2.5) below, that for every x > 1, with probability greater than 
1 — 8exp(— x), 

(1.3) JKff") < (1 + *) inf R(f) + n ^°f"/V) . 

jet t n 

The residual term e 2 obtained in [24] is optimal, but since it heavily 
depends on the parameter (3, it ranges between yV/n and V/n (up to 
a logarithmic factor). In particular, it can be as bad as the square root of the 
residual term of the nonexact oracle inequality (1.3) in the same situation. 
The main difference between the two results is that the condition E^ 2 < K£f 
for every / £ F is always satisfied whereas the condition that for every / € F 
E£ 2 < B(JLCf)@ depends on the relative position of Y and F, and thus on 
geometry of the system (F, Y). 

It is interesting to note that the residual term in (1.3) always yields fast 
rate even for hard classification problem such that P[y = l|A] = l/2. This 
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means that while the prediction problem in classification is completely blind 
to the geometry of the model, the estimation problem is influenced in a very 
strong way by the geometry of (F, Y). Thus, estimating the regression func- 
tion (or the Bayes rule) is in general much harder than predicting the out- 
put Y. 

Another related result is the one in [12] where (among other results) an 
exact oracle inequality is proved for the ERM with a residual term 5 n (x) . The 
residual term is controlled using the empirical oscillation 4> n {5) = E sup^g^) 



(P - P n )(£ f - i g )\ indexed by F{8) = {/ G F : PC f < 5}, and by the L 2 di- 
ameter D(S) = sup / ggF((5) ^P{lf-t g ) 2 



Note that all the quantities A*, el from [24], 5 n (x) from [12], fi* from [4] 
or Theorem 6.1 below, define the residual terms of the oracle inequalities as 
a fixed point of some equation. Those appear naturally either from iterative 
localization of the excess risk, converging to 5 n (x) [12, 16], or from an "iso- 
morphic" argument [4] identifying the "level" /x* at which the actual and 
the empirical structures are equivalent. We refer the reader to those articles 
for more details. 

Results in [4, 12, 24] were obtained under the boundedness assumption 
su P/gfII^/IIoo < 1 because the necessary tools from empirical processes the- 
ory, like contraction inequalities [21], only hold under such an assumption. 
In particular, these results do not apply even to the Gaussian regression 
model. The approach developed in this work provides a slight improvement, 
since risk bounds hold if the envelope function supj^pif is sub-exponential 
(which is the case for the Gaussian regression model with respect to the 
square loss). 

One should also mention the subtle but significant gap between the margin 
assumption and the Bernstein condition which we use. Both state that for 
every / G F, 



for some constant k > 1. However, in the margin condition /* has the min- 
imal risk over all measurable functions (for instance, /* is the regression 
function in the regression model with respect to the quadratic loss), while 
in a Bernstein condition f F is assumed to minimize the risk over F. 

The two conditions are equivalent only when f*£F (and thus /* = f F ). 
But in general, they are very different. As a simple example, in the bounded 
regression model [i.e., |y|,supj gi? |/(A)| < C] with respect to the quadratic 
loss, the margin assumption holds with k = 1 whereas the Bernstein condi- 
tion is not true in general. For more details on the difference between the 



5 n (x) = argmin I 5 > : <j) n {8) + 




) 



R(£ f - £ f *) 2 < B (E(£ f -If*)) 



1/k 
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margin assumption and the Bernstein condition we refer the reader to the 
discussion in [17]. 

1.2. Regularized empirical risk minimization. The second type of appli- 
cation we will present deals with nonexact regularized oracle inequalities. 
Usually a model F is chosen or constructed according to the belief that an 
oracle f F in F is close, in some sense, to some minimizer /* of the risk 
function in some larger class of functions F [e.g., in the regression model, /* 
can be the regression function and T = L 2 (Px)]- Hence, by choosing a par- 
ticular model F C F, it implicitly means that we believe /* to be close to F 
in some sense. 

It is not always possible to construct a class F that captures properties /* 
is believed to have (e.g., a low-dimensional structure or some smoothness 
properties). In such situations, one is not given a single model F (usually 
the set F is too large to be called a model) , but a functional crit : F — > R + , 
called a criterion, that characterizes each function according to its level of 
compliance with the desired property — and the smaller the criterion, the 
"closer" one is to the property. For instance, when F is an RKHS, one can 
take crit(-) to be the norm in the reproducing kernel Hilbert space, or when F 
is the set of all linear functionals in M. d , one may chose crit(/3) = \\fi\\e p for 
some p £ [0, oo] . The extreme case here is p = and \\(3\\e is the cardinality 
of the support of /3; thus a small criterion means that (3 belongs to a low- 
dimensional space. 

Instead of considering the ERM over the too large class F, the goal is to 
construct a procedure having both good empirical performances and a small 
criterion. One idea, that we will not develop here, is to minimize the em- 
pirical risk over the set F r = {/ 6 J r :crit(/) < r} [5, 40], and try to find 
a data-dependent way of choosing the radius r. Another popular idea is 
to regularize the empirical risk: consider a nondecreasing function of the 
criterion called a regularizing function and denoted by reg : F — > M + and 
construct 

(1.4) /™ E Argmin( J R n (/) + reg(/)) 

with the obvious extension if the infimum is not attained. 

The procedure (1.4) is called regularized empirical risk minimization pro- 
cedure (RERM). RERM procedures were introduced to avoid the "over- 
fitting" effect of large models [3, 23], and later used to select functions with 
additional properties, like smoothness (e.g., SVM estimators in [37]) or an 
underlying low-dimensional structure (e.g., the LASSO estimator). 

In this setup, we are interested in constructing estimators f n realizing the 
best possible trade-off between the risk and the regularizing function over F: 
there exists some e > such that with high probability 

(1.5) R(f n ) + reg(/ n ) < (1 + e) inf (R(f) + reg(/)). 
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Using the same terminology as in (1.1), inequality (1.5) is called a regular- 
ized oracle inequality. When e = 0, (1.5) is called an exact regularized oracle 
inequality, and when e > 0, (1.5) is called a nonexact regularized oracle in- 
equality. 

Following our analysis of the ERM algorithm, the next result is a regular- 
ized oracle inequality for the RERM. But before stating this result, one has 
to say a word on the way the regularizing function reg(-) and the criterion 
crit(-) are related. 

The choice of reg(-) is driven by the complexity of the sequence (i^ r ) r >o 
of models 



For any r > 0, the complexity of F r is measured by A*(r) defined as above 
for some fixed 0<e<l/2by 



Hence, A*(r) is a "level" in £p r above which the empirical and the actual 
structures are equivalent; namely, with high probability, on the set {£ £ £p r : 



Thus, the function r — > A*(r) captures the "isomorphic profile" of the collec- 
tion (^F r )r>o- Up to minor technical adjustments, the regularizing function, 
defined formally in (1.8), is reg(-) = A*(crit(-)). 

We will study two separate situations, both motivated by the applications 
we have in mind. In the first, crit(-) will be uniformly bounded and may 
only grow with the sample size n — that is, there is a constant C n satisfying 
that for every / £ J 7 , crit(/) < C n . The second case we deal with is when 
the "isomorphic profile" r — > A*(r) tends to infinity with r. For technical 
reasons, we also introduce an auxiliary function a n , defined in the following 
assumption. 

Assumption 1.1. Assume that for every / G J 7 , £f(Z) > a.s. and that 
there are nondecreasing functions <j) n and B n such that for every r > and 
every f &F r , 



Let < e < 1/2 and consider a function p n : R + x — > M nondecreasing in 
its first argument and such that, for any r > and x > 0, 



Assume that either: 

• there exists C n > such that for every / S J 7 , crit(/) < C n and in this case 
define a n (e, x) = C n , for all < e < 1/2 and x > 0, or 



F r = {/G^:crit(/)<r}. 



E\\P n -P\\ v{eFr)xnr) <(e/A)K(r). 



Pt> A*(r)} 



(l/2)P n £ <P£< {3/2)P n £. 



b n (£ Fr )<Mr) and Pi) < B n (r)P£ f + B 2 n {r) /n. 
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• the function r — > A*(r) tends to infinity with r and there exists K\ > 
such that 2p n (r,x) < p n {K\{r + l),x), for all r > and x > and, in this 
case, let /o be any function in \J r>0 F r and define a n such that, for every 
x>0 and 0<e<l/2, 

a n {e,x) > max[ifi(crit(/ ) +2), 

(1-6) (K)~\0- + 2e)(3i?(/ ) + 2K'(b n (£ fo ) + B n (crit(/ ))) 

x ((x + l)/n)))], 

where (A*) -1 is the generalized inverse function of A* [i.e., (A*) _1 (y) = 
sup(r > : A*(r) < y), for all y > 0] and K' is some absolute constant. 

Theorem B. There exist absolute positive constants cq, c\ K and K' 
for which the following holds. Under Assumption 1.1, for every x > and 

(1.7) /* ERM e Argminf + -^-p„(crit(/) + l,x + loga n (e,x))) , 

feT V l + 2e J 

with probability greater than 1 — 12exp(— x), 

R(fT RM ) + Pn(crit(/™) + 1, x + log a n (e, x)) 

(1 + 3e)i?(/) + 2 /0n (crit(/) + 1, s + log a n (e, x)) 

(b n (e f ) + B n (cnt(f))/e)(x + iy 

+ ci J - . 

ne 

Fortunately, a n usually has little impact on the resulting rates. For in- 
stance, in the main application we will present here, loga n (e, x) < e log(x + n). 

Like in Theorem A, the Bernstein-type condition PI 2 < B n (r)P£+B^(r)/n 
holds when I is nonnegative and sub-exponential for B n (r) < diam(^ r , tpi) x 
log(n). Therefore, and contrary to the situation in exact oracle inequalities, 
the "geometry" of the family of classes (-F r )r>o does not play a crucial role 
in the resulting nonexact regularized oracle inequalities. 

Observe that now the choice of the regularizing function in terms of the 
criterion is now made explicit: 

2 

(1.8) reg(/) = ^-^-p n (crit(/) + l,x + loga„(e,x)). 

1.3. £\-regularization. The formulation of Theorem B seems cumber- 
some, but it is not very difficult to apply it — and here we will present one 
application dealing with high-dimensional vectors of short support. Other 
applications on matrix completion, convex aggregation and model selection 
can be found in [20]. 

Formally, let (X, Y), (Xi, i^)i<i< n be n + 1 i.i.d. random variables with 
values in R d x R, and denote by Px the marginal distribution of X . The 



< inf 
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dimension d can be much larger than n but we believe that the output Y 
can be well predicted by a sparse linear combination of covariables of X; in 
other words, Y can be reasonably approximated by (X,/3q) for some (3q 6 M. d 
of short support (even though we will not require any assumption of this 
type to obtain our results). 

These kind of problems are called "high-dimensional" because there are 
more covariables than observations. Nevertheless, one hopes that under the 
structural assumption that Y "depends" only on a few number of covariables 
of X, it would still be possible to construct efficient statistical procedures 
to predict Y. 

In this framework, a natural criterion function is the £q function measuring 
the size of the support of a vector. But since this function is far from being 
convex, using it in practice is hard; see, for example, [35]. Therefore, it is 
natural to consider a convex relaxation of the Iq, function as a criterion: 
the £i norm [8, 10, 40]. 

In what follows, we will apply Theorem B to establish nonexact regular- 
ized oracle inequalities for ^i-based RERM procedures, and with fast error 
rates — a residual term that tends to like 1/n up to logarithmic terms. The 
regularizing function resulting from Theorem B for the L g -loss (q > 2) will 
be the qth. power of the £i-norm. In particular, for the quadratic loss, we 
regularize by || • ||| , the square of the l\-norm, 



(1.9) 



(\ n 

/3 n GArgmin - V(Y; - {X,, (3)) 2 + K (n, d 




while the standard LASSO is regularized by the l\ norm itself. This choice 
of the exponent is dictated by the complexity of the underlying models: the 
sequence of balls (rB d ) r >o trough the isomorphic profile function r — > A*(r). 
Observe that since H/SH^/v^ > 11/^11^ A 1 when H/S^ < y/n, a nonexact oracle 
inequality for the LASSO estimator itself follows from Theorem B, but with 
a slow rate of 1/ y/n. Using the qth power of the £i-norm as a penalty function 
for the L g -risk yields a fast 1/n rate (see Theorem C). 

We will perform this study for the L g -loss function, and in which case, 
for every /3 G M. d , 

1 n 

R^(I3)=E\Y-(X,I3)\i and R®(p) = - V \Y, - pQ,/3)|«. 

n * — ' 

i=i 

The following result is obtained only under the assumption that Y and 
H-X'll^d belong to . Since there are no "statistically reasonable" ip q vari- 
ables for q>2, it sounds more "statistically relevant" to assume that \Y\, 
\\X\\id are almost surely bounded when one wants results for the Lq-risk 
with q > 2, or that the functions are in L^ 2 for q = 2 (e.g., linear models 
with sub-Gaussian noise and a sub-Gaussian design satisfy this condition). 
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Theorem C. Let q>2. There exist constants cq and c\ that depend 
only on q for which the following holds. Assume that there exists K(d) > 
such that \\Y\\ip q , || \\^ q < K(d). For x > and < e < 1/2, let 

\(n,d,x) = c K(d) q (logn)( 4q - 2 V q (logd) 2 (x + logn) 
and consider the RERM estimator 

( \\&\\ q 
/3 n e Argmin R^{(3) + A(n, d, x) " 



ne 2 



Then, with probability greater than 1 — 12exp(— x), the L q -risk of (3 n satisfies 

(1 + 



R {q) n )< inf l{l + 2e)R^(P) + rj(n,d,x)- 



where rj(n,d, x) = c\K {d) q (\ogn)^ Aq 2 ^ q (\ogd) 2 (x + logn). 

Procedures based on the ^i-norm as a regularizing or constraint func- 
tion have been studied extensively in the last few years. We only mention 
a small fraction of this very extensive body of work [6-8, 13, 15, 22, 25, 
26, 40, 41, 45, 46]. In fact, it is almost impossible to make a proper com- 
parison even with the results mentioned in this partial list. Some of these 
results are close enough in nature to Theorem C to allow a comparison. In 
particular, in [4], the authors prove that with high probability, the LASSO 
satisfies an exact oracle inequality with a residual term ~ H/SH^/v^ U P to 
logarithm factors, under tail assumptions on Y and X. In [7], upper bounds 
on the risks K[(X,/3 n — /3o) 2 ] and \\ j3 n — /SolUi were obtained for a weighted 
LASSO j3 n when E(Y|A) = (X,(3q) for fio with short support. Exact oracle 
inequalities for RERM using an entropy-based criterion or on an i v crite- 
rion (with p close to 1) were obtained in [14, 15] for any convex and regular 
loss function and with fast rates. Similar bounds were obtained in [41] for 
a RERM using a weighted £i-criterion. In [6] it is shown that the LASSO 
and Dantzig estimators [8] satisfy oracle inequalities in the deterministic de- 
sign setup and under the REC condition. In fact, in most of these results the 
authors obtained exact oracle inequalities with an optimal residual term of 
|Supp(/3o)|(logd)/n, which is clearly better than the rate /n obtained 
in Theorem C for the quadratic loss and in the same context. 

However, it is important to note that all these exact oracle inequalities 
were obtained under an assumption that is similar in nature to the Restricted 
Isometry Property (RIP), whereas in Theorem C one does not need that kind 
of assumption on the design. Although it seems strange that it is possible 
to obtain fast rates without RIP there is nothing magical here. In fact, the 
isomorphic argument used to prove Theorem B (and thus Theorem C) shows 
that the random operator (3 £ R d -> n~ 1 / 2 YIi=i{ Y i ~ ( x i, G Rn satisfies 
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some sort of an RIP, which actually coincides with the RIP property in the 
noise-free case Y = (X,/3q) for an isotropic design. This indicates that RIP 
is not the key property in establishing oracle inequalities for the prediction 
risk, but rather, the "isomorphic profile" of the problem at hand, which 
takes into account the structure of the class of functions. 

Finally, a word about notation. Throughout, we denote absolute con- 
stants or constants that depend on other parameters by c, C, c±, C2, etc. 
(and, of course, we will specify when a constant is absolute and when it 
depends on other parameters). The values of these constants may change 
from line to line. The notation x ~ y (resp., x < y) means that there exist 
absolute constants < c < C such that cy < x < Cy (resp., x < Cy). If b > 
is a parameter, then x <^y means that x < C(b)y for some constant C{b) 
depending only on b. We denote by the space W 1 endowed with the £ p 

norm \\x\\^d = Q2j \ x j\ p ) 1 ^ P - The unit ball there is denoted by Bp and the 

unit Euclidean sphere in M d is S"^ -1 . 

2. Preliminaries to the proofs. In this section we obtain a general bound 
on E||P — -Pn||(^ F ) A for the L^-loss when q>2, and show that a Bernstein- 
type condition is satisfied under weak assumption on the loss function. 

2.1. Isomorphic properties of the loss class. The isomorphic property of 
a functions class measures the "level" at which empirical means and actual 
means are equivalent. The notion was introduced in this context in [4]. Al- 
though it is not a necessary feature of this method, if one wishes the isomor- 
phic property to hold with exponential probability, one can use a high proba- 
bility deviation bound on the supremum of the localized process. A standard 
way (though not the only way, or even the optimal way!) of obtaining such 
a result is through of Talagrand concentration inequality [38] applied to 
localizations of the function class, combined with a good control of the vari- 
ance in terms of the expectation (a Bernstein-type condition). When applied 
to an excess loss class, this argument leads to exact oracle inequalities; see, 
for example, [5, 32]. Here we are interested in nonexact oracle inequality, 
and thus, we will study the isomorphic properties of the loss class. To make 
the presentation simpler, we are not dealing with a fully "unbounded the- 
ory" like in [27], but rather that the class has an envelope function which 
is bounded in tpi, and we follow the path of [32], in which one obtains 
the desired high probability bounds using Talagrand's concentration theo- 
rem. Since we would like to avoid the assumption that the class consists 
of uniformly bounded functions, an important part of our analysis is the 
following tpi version of Talagrand's inequality [1]. 

Theorem 2.1. There exists an absolute constant K > for which the 
following holds. Let Zx,...,Z n be n i.i.d. random variables with values in 
a space Z , and let G be a countable class of real-valued measurable functions 
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defined on Z. For every x > and a > 0, with probability greater than 1 — 
4exp(— x), 

\\P - Pu\\g < (1 + a)E\\P - Pn\\ G + Ka(G)^ + K(l + a" 1 )6„(G)-. 

V n n 

Using the same truncation argument as in [1], it follows that for every 
single function g £ -^(-P) and every a,x > 0, with probability greater than 
1 — 4exp(— x), 



Pn9 < (1 + a)Pg + K\ + K{1 + a" 1 ) 



n n 

and, in particular, if there exists some B n > for which Pg 2 < B n Pg + 
B 2 /n, then for every < a < 1 and x > 0, with probability greater than 
1 — 4exp(— x), 

(2.1) P n g < (1 + 2a)P 5 + A w (l + cT 1 )^) + B n )— . 

n 

Theorem 2.1 can be extended to classes G satisfying some separability 
property like condition (M) in [24]. We apply Theorem 2.1 in this context 
and it will be implicitly assumed that every time we use Theorem 2.1, this 
separability condition holds. In particular, Theorem 2.1 will be applied to 
the localized sets V(If)\ to get nonexact oracle inequalities for the ERM 
algorithm and to the family (V(iF r )\) r >o to get nonexact regularized oracle 
inequalities for the RERM procedure. 

Observe that Theorem 2.1 requires that the envelope function sup 5gG \ g\ 
is sub-exponential, but since ||maxi<j< n XjH^ < HXH^ logn it follows that 
b n (^F) is not much larger than ||sup 9gG (7(X)||^ 1 . However, this condition 
can be a major drawback. For instance, if the set G consists of linear func- 
tions indexed by the Euclidean sphere and X is the standard Gaussian 
measure on M. d , the resulting envelope function is bounded in ipi(lJt), but its 
norm is of the order of yd. In Theorem C, we bypass this obstacle by assum- 
ing that ||V||^ 9 , IHI-X'llfd \\ip q < K{d). This assumption is far better suited for 
situations in which the indexing class is small — like localized subsets of B>f 
that appear naturally in LASSO type results. 



Theorem 2.2. Let F be a functions class and assume that there exists 
B n >0 such that for every f eF, Pi 2 f < B n P£ f + B 2 Jn. I/O < e < 1/2 and 
A* > satisfy that 

E\\P n -P\\ v{eFh ,<(e/A)X*, 

then for every x > 0, with probability larger than 1 — Ae~ x , for every f £ F 

P£ f <(l + 2e)P n l f + 

Pn (^) ; 
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where, for K the constant appearing in Theorem 2.1, 

(4Kb n (£ F ) + (6K) 2 B n /e)(x + l) 



p n (x) = max A*, 



ne 



Proof. The proof follows the ideas from [4]. Fix A > and x > 0, and 
note that by Theorem 2.1, with probability larger than 1 — 4exp(— x), 

||P - P n \\ V (i F ) x < 2E||P - P n \\v(e F ) x + K*(V(e F ) x )J^ 

(2.2) 

+ Kb n (V(£ F )x)-. 

n 

Clearly, we have b n (V(£ F )\) < b n (£ F ) and 

o- 2 (V{£ F )x) = sup(P(a£ f ) 2 : < a < 1, / G F, P(a£ f ) < A) < B n X + B 2 n /n. 

Moreover, since V(£f) is star-shaped, A > — > 0(A) = E||P — P n \\v(e F ) x /^ 
is nonincreasing, and since 4>(Xl) < e/8 and p n (x) > A*, then 

E||P-P n || v(Mp7iW <( e /4K(^). 

Combined with (2.2), there exists an event £lo(x) of probability greater than 
1 — 4exp(— x), and on £Iq(x), 



l {B nPn {x) + Bl/n)x b n (£ F )x 
\ p ~ p n\\v(e F ) Pn(x) < {e/2) Pn (x) + K^j + K - 

< ep n (x). 

Hence, on fio(x), if g G V(£ F ) satisfies that Pg < p n (x), then |Pg — P n g\ < 
ep n (x). Moreover, if P£ f = /3 > p n (x), then g = p n (x)£ f /(3 G V(£ F ) Pn ( x y, 
hence |P# - P n g\ < ep n {x), and so (1 - e)P£ f < P n £ f < (1 + e)P£ f . □ 

2.2. The Bernstein condition of loss functions classes. In Theorem A, 
the desired concentration properties (and thus the fast rates in Theorem C) 
rely on a Bernstein-type condition, that for every / G P, 

(2.3) P£}<B n P£ f + Bl/n. 

Assumption (2.3) is trivially satisfied when the loss functions are positive 
and uniformly bounded: if < £j < B, then P£"j < BP£f. It also turns out 
that (2.3) does not require any "global" structural assumption on F and is 
trivially verified if class members have sub-exponential tails. 

Lemma 2.3. Let X be a nonnegative subexponential random variable. 
Then for every z>l, 

(4 + 61og 2 (ez)||X||2 i ) 



¥.X Z <\og{ez)\\X\\^X ■ 



ez 
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Proof. Fix 9 > 0, and note that 

poo poo 

EI 2 lj{> e = / 2t¥[Xt x > e >t]dt = 9 2 F[X >9} + 2 / W[X > t] dt 
Jo Je 

/•oo 

(2.4) < 29 2 exp(-9/\\X\\^)+4 texp(-t/\\X\\^) dt 

Je 

< (29 2 + 4011X11^ + 4) exp(-e/\\X\\^). 
Since X > 0, it follows from (2.4) that, for any 9 > 0, 
EX 2 < EX 2 l x <e + EX 2 t x > e 

< 9EX + (29 2 + MWXWfr + 4) exp(-0/||X||^). 
The result follows for 9 = WX]]^ log(ez). □ 

In particular, if If > and H^/H^ < D for some D > 1, then for every 
n> 1, 

< (c ^log ( en))E^ + ^ Dl0 ^ 2 . 

2.3. Upper bounds on E||P — P n \\y^ p ^ x . Let be the loss class associ- 
ated with F for the ERM or with a class F r for some r > for the RERM. 
The next step is to obtain bounds on the fixed point of the localized process, 
that is, for some cq < 1, to find a small A* for which 

E||P-P n || v(H)x , <c A*. 

Note that the complexity of the star-shaped hull V(H) is not far from 
the one of H itself. Actually, a bound on the expectation of the supremum 
of the empirical process indexed by V(H)\ will follow from one on for 
different levels fi G {2 l \:i G N}. This follows from the peeling argument 
of [5]: that V(H) X C \JZo{ eh:0 ^ <2~\h e H,Eh < 2 i+1 X}. Therefore, 
setting H,j, = {h G H : Eh < /J,}, for all fi > and R* = inf^^E/i, 

(2.5) E||P-P n || m)A < 2^E\\P-P n \\ H2i+lx , 

{i: 2 i + 1 X>R*} 

because if 2* +1 A < R*, then the sets H 2 i+i\ are empty. Thus, it remains to 
bound E||P — P n ||j? M for any fi> 0. 

Let us mention that a naive attempt to control these empirical processes 
using a contraction argument is likely to fail, and will result in slow rates 
even in very simple cases (e.g., a regression model with a bounded design). 
We refer to [11, 31, 33] for more details. 

The bounds obtained below on E||P — P n ||_H" M are expressed in terms of 
a random metric complexity of H, which is based on the structure of a typical 
coordinate projection P a H. These random sets are defined for every sample 
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a = (X 1 ,...,X n ) by 

P a H = {(f(X 1 ),...J(X n )):f€H}. 

The complexity of these random sets will be measured via a metric in- 
variant, called the 72-functional, introduced by Talagrand as a part of the 
generic chaining mechanism. 

Definition 2.4 ([39]). Let (T,d) be a semi- metric space. An admissible 
sequence of T is a sequence (T s ) sS m of subsets of T such that |To| < 1 and 
\T S \< 2 2S for any s > 1. We define 

00 

72 (r,d)= inf supV2 s / 2 ( i(t,T s ), 
(T s )s& teT ^0 

where the infimum is taken over all admissible sequences (T s ) sg N of T. 

We refer the reader to [39] for an extensive survey on chaining methods 
and on the 72-functionals. In particular, one can bound the 72-functional 
using an entropy integral 

/•diam(T,d) 

(2.6) l2(T,d)< Vlog N(T,d,e)de, 

Jo 

where N(T,d,e) is the minimal number of balls of radius e with respect to 
the metric d needed to cover T, and diam(T, d) is the diameter of the metric 
space (T, d) . 

We will use the 72-functional to state our theoretical bounds because there 
are examples in which 72 (T, d) is significantly smaller than the corresponding 
entropy integral. However, in all our concrete applications we will use the 
bound (2.6) since the computation of those is much simpler, the gap is at 
most logarithmic and the purpose of this note is not to obtain the optimal 
estimates but to show that the residual terms in exact and nonexact oracle 
inequalities could be very different. 

Now, we turn to some concrete examples where H is the loss functions 
class in the regression model with respect to the L g -loss. 

Let q > 2 and set the L^-loss function of / to be tf \x,y) = \y — f(x)\ q . 

In this case, the L q -loss functions class localized at some level /j, is {yp)ix = 

{lf:feF,mf<iA- 

The following result is a combination of a truncation argument and Rudel- 
son's L^, method. To formulate it, set M = ||sup , (,v \t\ |L, , for any A C 

letA = AU-A, and if = {/ <EF:Pif < /x}, put U n = E-y^P^F^ J 1 ^). 

Proposition 2.5. For every q>2, there exists a constant cq depending 
only on q for which the following holds. If F is a class of functions, then for 
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any [i > : 

(1) ifq = 2, thenE\\P-P, 



< cq maxU / fi 



,U„ U„] 



(2) if q > 2, i/ien EllP — P n L«(«)\ *s upper bounded by 



cq max 



\ V—\/(Mlogn)(i- 2 )/i, ^(Mlogn)^" 2 )/", 
V n v n 



M log n 



?? 



PROOF. Let <j)(h) = sign(/i)min(|/i|,6') where 6 > is a threshold to be 
fixed later. For feF, set /i/(x,y) = y-f(x), let = {h f : f e F,E\h f \ g < fx}, 
and note that \h\ q = \(/>(h)\ q + (\h\ q - q )l\ h \> e . Thus, 

n\p-Pn\\ {efh 

= E sup|(P n -P)(|/ l | ,? )| 

<E sup|(P n -P)(|^)(/ i )| <? )|+E sup P n |fc|*!| h |>, 

+ SUp P|/l|«l| ft |> fl 

<E sup|(P n -P)(|^(/ l )| ,? )| + 2E(sup |/i|«%>A 



To upper bound the truncated part of the process, consider the empir- 
ical diameter D n = sup /ieHfi (P n |</>(/i)| 2g " 2 ) 1/(2<? " 2) . By the Zine-Ginn sym- 
metrization theorem [42] and the upper bound on a Rademacher process by 
a Gaussian one, 



1 n 
In ^-^ 

i=l 



E sup \{P n -P){\(p{h)\ q )\ < ^LEE 9 sup 

where g±,...,g n are n independent standard random variables and E ff de- 
notes the expectation with respect to those variables. For a fixed sam- 
ple (Xi,Yi)f =1 , let (Z(/i))/jg// be the Gaussian process defined by Z(h) = 
™~ 1/2 I WPQ, Y)\ q - If f,geF, then 
\2 



Eg(Z(hf) ~ Z(hg)Y 



n 

^-E9 2 l/(^)-ff(^)l 2 max(|0(/ l/ )(x l ,y J )|,|^)(x i ,y i )|) 2 «- 



i=l 



< 2q 2 max (/(X,) - g(Xi)) 2 D%- 2 , 
KKn 
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where we have used that H^u)! 9 — ^(v)! 9 ) < q\u — v\ max(|0(u)|, |</>(f)l) 
for every u,v € M. By a standard chaining argument it follows that 



9-1 



1 n 

7 =Y J g l \<t>{h f )(x u Y l ) 

i=l 



(2.7) E g sup 

and thus, Esup^ \(P n - P)(\</>(h)\ q )\ < c 2 q\j ^ V^-D 2 ^ 2 . 

A bound on the diameter follows from (2.7) and the contraction principle, 

ED 2q ~ 2 < E sup \{P n - P)(|^)| 2 "- 2 )| + sup P\m\ 2q ~ 2 



< -= — E„ sup 

fa heHu 



1 



— J^gMhKXi.Yi] 



8=1 



< c 2 qe q ~ 



U n ED. 



2q-2 



implying that EDl Q 2 < c 3 max(q 2 9 2q - 4 U n /n,9 q ~ 2 ^) and so 
(2.8) 



Esup|(P„-P)(|0(>O|«)|< 

h€H u 



04(7 max 



qU n 9 q ~ 



n 



n 



Next, observe that for q = 2, the right-hand side in (2.8) does not depend 
on the truncation level 9, and thus one may take 9 arbitrarily large, leading 
to the desired result. 

For q^2, consider the unbounded part of the process. Since the envelope 
function of exhibits a subexponential decay, then 



E(sup \h\ q t\ h \ >e 



sup \h\ q t w > e >t 

hdH u 



(It 



= 9 q ¥ 


sup \h\>9 


poo 

+ / p 


sup \h\ q > t 








l h£H,, i 


< 29 q exp(-9 q /M) + 2M exp 


{-9 q /M). 


The result follows by taking 9 q 


= Mlog?i. 


□ 





dt 



3. Proof of Theorem A. In this section, we will present the proof of 
Theorem A, which follows the same ideas as [4, 5] for the excess loss. 



Lemma 3.1. There exists an absolute constant cq > for which the fol- 
lowing holds. Let F be a class of functions, and assume that there is some B n 



such that for every fEF, Pil < B n P£ f + Bl/n. For x > and < e < 1/2, 



consider an event ^o^) on which for every f G F, 
R(f)<(l + 2e)R n (f) + Pn(x) 
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where p n {') *s some fixed increasing function. Then, with probability greater 
than F(Q Q (x)) - 4exp(-z), 

R(f^) < (1 + 36) inf (R(f) + CQ (bn(if) + B n )( X + l) ^ + ^ 

Proof. Fix x > 0, let K' be the constant introduced in (2.1), consider 
/• e A rgm i„( W ) + 15ir^/> + B «><* + 1 > 



and without loss of generality one assume that the infimum is achieved. 
By (2.1) [for a = (e/2)/(l + 2e)], the event n*(x) on which 

RMi<^R(n + 5K> {b ^ )+B ^ x+1) 



l + 2e ne 
has probability greater than 1 — 4exp(— x). Hence, 

-(1 + 3e)R(n < -(1 + 2e)R n (t) + l^R' ^^ + Bn){x + l) 

ne 

and on £Iq(x) n Q*(x), every / in F satisfies that 

R(f) - (1 + 3e)R(f*) < (1 + 2e)(R n (f) - R n (f*)) + Pn(x) 

, 15K , (b n (ir) + B n )(x + l) 
ne 

Since i?„(/„ ERM ) - i^(/*) < 0, then 

^(/„ ERM ) < (1 + 3e)R(n + 15 K Mif * )+Bn){x + 1) + Pn (x), 

ne 

and the claim now follows from the choice of /*. □ 

Proof of Theorem A. Let x > 0, < e < 1/2, and put 

/ mqefBn + {AK/e)b n {l F )){x + l) 
p n (x) = max A e , 



n 

By Theorem 2.2, the event Qq(x), on which every f £ F satisfies that 
R(f)<(l + 2e)R n (f) + Pn(x), 

has probability greater than 1 — 4exp(— x). Now, the result follows from 
Lemma 3.1. 

The remark following Theorem A, that if £ is nonnegative, then lp satis- 
fies a Bernstein-type condition with B n ~ diam(^, log(en) follows from 
Lemma 2.3. □ 

4. Proof of Theorem B. Although the proof of Theorem B seems rather 
technical, the idea behind it is rather simple. First, one needs to find a "triv- 
ial" bound on crit(/ RERM ), giving preliminary information on where one 
must look for the RERM function (this is the role played by the function a n ). 
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Then, one combines peeling and fixed point arguments to identify the exact 
location of the RERM. 

Note that for F = \J r>0 F r , we have crit(/) = oo for all / € F\ F. There- 
fore, without loss of generality, we can replace the set T by F in both the 
definition of the RERM in (1.4) and in the nonexact regularized oracle in- 
equality of Theorem B. 

We begin with the following rough estimate on the criterion of the RERM. 
In the case where there is a trivial bound crit(/) < C n , for all / £ F then 

it follows that for any < e < 1/2 and x > 0, crit(/ RERM ) <C n = a n (e,x). 
Turning to the second case stated in Assumption 1.1, recall that r — > A*(r) 
tends to infinity with r and there exists K\ > such that for every (r, x) G 
M + x 2p n (r,x) < p n (Ki(r + l),x). Hence, for every x > and < e < 
1/2, we set ct n to satisfy that 

a n (t,x) > max[Ki(crit(/ ) + 2), 

(A:)" 1 ((l + 2e)(3i?(/ ) + 2K'(b n (£ f0 ) + £? n (crit(/ ))) 

x((x + l)/n)))], 

where /o is any fixed function in F (e.g., when € F, one may take fo = 0), 
and (A*) -1 is the generalized inverse function of A*. In this case, we prove 
the following high probability bound on crit(/ RERM ). 

Lemma 4.1. Assume that r— > A*(r) tends to infinity when r tends to 
infinity and that there exists K\ > such that for every (r,x) € R + x 
2p n (r,x) < p n {Ki{r + l),x). Then, under the assumptions of Theorem B, 
for every x > and < e < 1/2, with probability greater than 1 — 4exp(— x), 
crit(/ RERM )< a n (e,x). 

Proof. By the definition of / RERM , 

i?n(/ n RERM ) + T ^Pn(crit(/ n RERM ) + 1, x + log a n (e, x)) 
2 

< Rn(fo) + 1 + 2e Pn(crit(/ ) + l,x + loga„(e,x)). 

Since i is nonnegative, then i? n (/ RERM ) > 0, and thus 
/) n (crit(/ RERM ) + l,x + loga n (e,x)) 

< (1 + 2e)# n (/o)/2 + Pn(crit(/ ) + l,x + loga n (e,x)) 

< max((l + 2e)i? n (/ ), 2/3 n (crit(/ ) + 1, x + loga n (e, x))). 

Since p n {r, x) > A*(r), for all r > 0, one of the following two situations occurs: 
either 

A:(crit(/ RERM ))< (l + 2e)i2 n (/ ) 
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or, noting that for every (r,x) € M+ xl|, 2p n (r,x) < p n (K\{r + 1), x), then 
Pn (crit (/ RERM ) + 1 , x + log a n (e, x) ) 

< 2p n (crit(/ ) + l,x + loga n (e,x)) 

< p n {K\ (crit (/o) + 2) , x + log a n (e, x) ) , 

and since /0 n is monotone in r then crit( j RERM ) < i^i(crit(/o) + 2). 
Hence, in both cases 

(4.1) crit(/™) < max((A:)- 1 ((l + 2e)i? n (/o)),iT 1 (crit(/o) + 2)). 

On the other hand, according to (2.1), with probability greater than 
1 _4exp(-x), Rn(fo) < SR(fo) + 2K'(b n (£ fo ) + B n (crit(/ )))(x + l)/n. The 
result follows by plugging the last inequality in (4.1) and since X e is nonde- 
creasing. □ 

The next step is to find an "isomorphic" result for / RERM . The idea is 
to divide the set given by the trivial estimate on crit(/ RERM ) into level sets 
and analyze each piece separately. 

Lemma 4.2. Under the assumptions of Theorem B, for every x > 0, with 
probability greater than 1 — 8exp(— x), 

R(fn ERM ) < (1 + 2e)i? n (/ RERM ) + Pn (crit(/™) + l,x + loga n (e,x)). 
Proof. Let Qq(x) be the event 

^(/n RERM )-^n(/™) >1 

2eR n (f^ M ) + p n (cnt(f^ M ) + l,x + loga n (e,x)) ~ ' 

and we will show that this event has the desired small probability. 
Clearly, 

¥[n (x)} < F[n (x) n {crit(/ n RERM ) < a n (e,x)}] + P[crit(/ RERM ) > a n (e,x)}, 

and by Lemma 4.1, P[crit(/ RERM ) > a n (e, x)] < 4exp(— x) in the second case 
of Assumption 1.1 or P[crit(/ RERM ) > a n (e,x)] = when there is a trivial 
bound on the criterion. Therefore, in any case, we have P[crit(/ RERM ) > 
a n (e,x)] < 4exp(-x). 

Recall that Fi = {/ G F : crit(/) < i}, for all i G N, and since p n is mono- 
tone in r, then 

P[Q (x)n{crit(/ RERM )<a n (e,x)}] 

< Yl IP^oWn{i<crit(/ RERM )<^ + l}] 
i=0 
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R(f) > (1 + 2e)P n (/) + Pn (i + 1, x + log a n (e, x))}. 

By Theorem 2.2, for every f > and i G N, with probability greater than 
1 - 4exp(-t), for every / G P i+ i, P€/ < (1 + 2e)P n €/ + p n (i + M). In par- 
ticular, 

P[3/ G P+i : R(f) > (1 + 2e)P n (/) + p n (i + 1, x + loga n (e, x))] 
< 4exp(-(x + log a n (e,x))). 
Hence, the claim follows, since 

F[n (x)n{cnt(f* ERM )<a n (e,x)}} 

\a n {e,x)\ 

< ^2 4exp(— (x + loga n (e,x))) < 4exp(— x). 

i=o a 

Proof of Theorem B. Let x > and < e < 1. Without loss of gen- 
erality, we assume that, for the constant K' defined in (2.1), there exists 
/* G F minimizing the function 

/ G F — > (1 + 3e)P(/) + /)„(crit(/) + 1, x + log a n (e, x)) 

+ 6Jf / (fcn(^/) + Pn(crit(/))(x + l) ^ 

en 

Let fi*(x) be the event on which 

i^/*) < 1 + 3e R(f*) + ^ (fcn(^) + gn(crit(/*)))( a : + l) ^ 1 +3e ^ 

Since /* G P cri t(/*), then Pi), < B n (crit(/*))P£ r + P^crit (f*))/ n , and 
by (2.1) [applied with a = e/(l + 2e)], P(fi*(x)) > 1 - 4exp(-x). 
Consider the event Qq(x), on which 

«(/™)<(l + 2 e )P„(/ n RERM ) 

+ p n (crit(/* ERM ) + l,x + loga n (e,x)) 

and observe that by Lemma 4.2, P[fio(x)] > 1 — 8exp(— x). Therefore, on 
£lo(x) n 0*(x), we have 

P(/™) + P«(crit(/™) + l,x + loga n (e,x)) 

-(l + 3e)P(D 

<(l + 2e)(P n (/™)-P n (r)) 
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+ 2p„(crit(/ RERM ) + l,x + loga n (e,x)) 
, 6/ ^ (^(V)+i?n(crit(r)))(x + l) 



en 

< (1 + 2e) (it> n (/ RERM ) + 7^/>n(crit(/ RERM ) + l,x + loga n (e,x)) 

2 

- Rn(f ) - 1 + 2£ Pn(crit(/*) + l,x + loga„(e,x)) 
+ 2p n (crit(/*) + l,x + loga„(e,x)) 

, 6 ^ (&n(^) + gn(crit(/*)))(x + l) 

en 

< 2p n (crit(/*) + l,x + loga n (e,x)) 
r,(b n (£ f *) + B n (crit(f*)))(x + l) 



+ 6K' 



en 

pRERM 



where the last inequality follows from the definition of /„ • Hence, by 
the choice of /*, it follows that on f2i(x) n Q*(x), 

^(/n RERM ) + Pn(crit(/ RERM ) + l,x + loga n (e,x)) 

< (1 + 3e)R(f*) + 2p n (crit(/*) + 1, x + loga n (e,x)) 

6K , (b n (e r ) + B(crit(r)))(x + l) 
en 

= inf ( (1 + 3e)R(f) + 2p n (crit(/) + 1, x + log a n (e, x)) 

+ 6 ^ (bn(^/)+^(crit(/)))(x + l) \ ^ 

en J ' □ 

5. Proofs of Theorem C. Theorem C follows from a direct application 
of Theorem B, by estimating the specific function p n and the "Bernstein 
function" B n (r). 

Consider the family of models (-F r )r>o associated with the ^i-criterion 
F r = {/^ : \\(3\\i < r}, where fp(x) = (x,(3) is a linear functional on M. d . 

Lemma 5.1. There exists an absolute constant cq for which the following 
holds. For every ji and r > 0, and every a = (X\, . . . ,X n ), 

72 (^ r ,C)<cor(max||X i ||^)(logd)logf-^| 

Moreover, i/ || \\^ 2 <K(d), then 

(E7|(^F r ,C)) 1/2 < c r^(d)(logn) 3 / 2 (logd). 
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The proof of the first part of the claim is rather standard and has appeared 
in one form or another in several places; for example, see [5]. It follows 
from (2.6) and Maurey's empirical method (cf. [9, 36]). The second part is 
an immediate corollary of the first one. 



Proof of Theorem C. Observe that for every (3 G rBf, 

<x,/?)HI^ = \\Y-(x,p)\\l q < (\\y\\ a + \\(x,p)\\^ q y 

Hence, by Lemma 2.3, one may take B n {r) = co(2K(d)) g (l + r) q log(era). 

Next, the V'l-norm of the envelope of the class F r satisfies Hsup^g^djy — 
(X,p)m^ < (K(d))i(l + r)i, and by (2.5), Proposition 2.5 and Lemma 5.1, 
for every A > 0, 

E\\P-P n 



i=0 

oo 

<c ^2 

i=0 



< ^2~ i E||P-P n | 



n 



r 2 (l + r) q ~ 2 h(n, d) K(d)i(\ + r) q {\ogn) 



n 



n 



< ci max 



(1 + r)ih(n, d) (1 + r) q h(n, d) 



n 



where h(n,d) = K (d) q (log n)( 4 <?- 2 )/<? (log d) 2 . Set A*(r) = c 2 (l + r) q h(n,d)/ 
(ne 2 ) and observe that E||P — P n 



b n {lf r ) = max sup if {Xi , Y t ) < c 3 (log en) 



oil) 



'/'i 



sup if(X,Y) 



01 



then one can take 4> n (r) = csK{d) q (log n)(l + r) q . Thus 

h(n,d)(l + r q ) 



p n (r,x) = c 4 - 



1 + x) 



is a valid isomorphic function for this problem. It is also easy to check that 
for /o = 0, loga n (e,x) < C5log(max(a;,?i)||Y||^). The result now follows by 
combining these estimates with Theorem B. □ 

6. Remarks on the differences between exact and nonexact oracle in- 
equalities. The goal of this section is to describe the difference between 
the analysis used in [4] to obtain exact oracle inequalities for the ERM, and 
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the one used in this note to establish nonexact oracle inequalities for the 
ERM (Theorem A). Our aim is to indicate why one may get faster rates for 
nonexact inequalities than for exact ones for the same problem. 

One should stress that this is not, by any means, a proof that it is im- 
possible to get exact oracle inequalities with fast rates (there are in fact ex- 
amples in which the ERM satisfies exact oracle inequalities with fast rates: 
the linear aggregation problem, [12]). It is not even a proof that the lo- 
calization method presented here is sharp. A detailed study of the isomor- 
phic method and oracle inequalities for a general sub-Gaussian case (i.e., 
a sub-exponential squared loss), in the sense that the class F has a bounded 
diameter in L^ 2 rather than an envelope function, will be presented in [27]. 

However, we believe that this explanation will help to shed some light on 
the differences between the two types of inequalities, and we refer the reader 
to [27] for a more detailed and accurate analysis. 

Our starting point is the following exact oracle inequality for ERM, which 
is a mild modification of a result from [4]. The only difference is that it uses 
Adamczak's ip\ version of Talagrand's concentration inequality for empirical 
processes, instead of Massart's version. 

Theorem 6.1. There exists an absolute constant cq > for which the 
following holds. Let F be a class of functions and assume that there exists 
B > such that for every f€F, PC) < BPC f . Let p* > be such that 
E||P n — P\\v(c F ) * — an d consider an increasing function p n which 

satisfies that, for every x > 0, p n {x) > max(/x*, co(b n (Cp) + B)x/n). Then, 
for every x > 0, with probability greater than 1 — 8exp(— x), the risk of the 
ERM satisfies R(f% RM ) < inf /gF R(f) + p n {x). 

Roughly put, and as indicated by the theorem, localization arguments are 
based on two main components: 

(1) A Bernstein-type condition, the essence of which is that it allows 
one to "translate" localization with respect to the loss or the excess loss to 
a localization with respect to a natural metric. In particular this leads to 
the necessary control on the diameter of a random coordinate projection 
of the localized class. 

(2) The fixed point of the empirical process indexed by the localized star- 
shaped hull of the loss functions class (for nonexact inequalities) or of the 
excess loss functions class (for exact ones). 

Although the two components seem similar for the exact and nonexact 
cases, they are very different. Indeed, for a nonexact oracle inequality, the 
Bernstein type condition is almost trivially satisfied and requires no special 
properties on the model/output couple (F, Y) — as long as the functions in- 
volved have well behaved tails. As such, it is an individual property of every 
class member; see Lemma 2.3. 
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On the other hand, the Bernstein condition required for the exact oracle 
inequality is deeply connected to the geometry of the problem; see, for exam- 
ple, [30]. More accurately, when the target Y is far from the set of multiple 



minimizers of the risk, N(F, £, X) = {Y : \{f G F : R(f) = mi feF R(f)}\ > 2}, 



one can show that a Bernstein condition holds for a large variety of loss 
function I. However, when the target Y gets closer to the set N(F,£,X), the 
Bernstein constant B degenerates, and leads to rates slower than 1/y/n even 
if F is a two functions class. Hence, the geometry of the problem (the rela- 
tive position of Y and F) is very important when trying to establish exact 
oracle inequalities, and the Bernstein condition is truly a "global" property 



In particular, this explains the gap that we observed in the example pre- 
ceding the formulation of Theorem A. In that case, the class is a finite set of 
functions and the set N(F, £, X) is nonempty. Thus, one can find a set F and 
a target Y in a "bad" position, leading to an excess loss class Cf with a triv- 
ial Bernstein constant (i.e., greater than \/n). On the other hand, regardless 
of the choice of Y, the Bernstein constant of £p is well behaved. 

Let us mention that when the gap between exact and nonexact oracle 
inequalities is only due to the Bernstein condition, it is likely that both 
ERM and RERM will be suboptimal procedures [19, 28, 44]. In particular, 
when slow rates are due to a lack of convexity of F (which is closely related 
to a bad Bernstein constant of Cf), one can consider procedures which 
"improve the geometry" of the model (e.g., the "starification" method of [2] 
or the "pre-selection-convexification" method in [18]). 

The second aspect of the problem is the fixed point of the localized empir- 
ical process. Although the complexity of the sets Cf and lp seems similar 
from a metric point of view {Cf is just a shift of £p) the localized star- 
shaped hull (Cf)\ and (£f)x are rather different. Since there are many ways 
of bounding the empirical process indexed by these localized sets, let us show 
the difference for one of the methods — based on the random geometry of the 
classes, and for the sake of simplicity, we will only consider the square loss. 
Using this method of analysis at hand, the dominant term of the bound on 
E||P — P n || ( 2 ), (for the loss class) which was obtained in Proposition 2.5 



A similar bound was obtained for E||P — Pn\\v{C F )^ hi [32] and [5], in which 
the dominant term is 



of F. 



is 



(6.1) 




(6.2) 
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If this bound is sharp (and it is in many cases), and since R* = inf f^p R(f) 
is in general a nonzero constant, the fixed point fi* of Theorem 6.1 is 



of the order of y E7|(i- > -i ? (^*),£^ )/?i and thus leads to a rate decaying 
more slowly than 1/^/n. In contrast, in the nonexact case one has A* ~ 
E7|(i- > CT i ? ( A *),£^ )/n which is of the order of 1/n (up to logarithmic factors) 
when the complexity E7|(P cr F,£^ ) is "reasonable." 

The reason for this gap comes from the observation that functions in the 
star hull of lp whose expectation is smaller than R* are only "scaled down" 
versions of functions from lp. In fact, the "complexity" of the localized 
sets below the level of R* can already be seen at the level R* . Hence, the 
empirical process those sets index (when scaled properly), becomes smaller 
with A. 

In contrast, because there are functions Cf that can have an arbitrarily 
small expectation, the complexity of the localized subsets of the star hull 
of Lp (normalized properly, of course) can even increase as A decreases. This 
happens in very simple situations; for example, even in regression relative 
to B-f , if R* ^ 0, the complexity of the localized sets remains almost stable 
and starts to decrease only at a very "low" level A. This is the reason for 
the phase transition in the error rate (~max{ yj (log d) /n, d/n}) that one 
encounters in that problem. The first term is due to the fact that the com- 
plexity of the localized sets does not change as A decreases — up to some 
critical level, while the second captures what happens when the localized 
sets begin to "shrink." A concrete example of this phenomenon is treated in 
the Supplementary material [20] in the Convex aggregation context. 



Applications to matrix completion, convex aggregation and model selec- 
tion (DOI: 10.1214/11-AOS965SUPP; .pdf). In the supplementary file, we 
apply our main results to the problem of matrix completion, convex aggrega- 
tion and model selection. The aim is to expose the fundamental differences 
between exact and nonexact oracle inequalities on classical problems. 
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